Use filter to select observations (rows) from a tibble.

In the previous section we learned how to select variables (columns) with all its values using the select function. In this section we are interested in a collection of observations (rows). For this we have the function filter which takes as its first argument the tibble followed by a logical condition describing the values to be filtered on.

Logical conditions

A logical condition is a way to express whether some statement is true or false. For example, let’s say we want to know which individuals in our pulse dataset are taller than 190 cm? You can achieve this with filter :

filter(pulse, height>190) # filter all observations with height greater than 190 cm 
# A tibble: 5 x 13
  id     name        height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>        <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_D Travis         195     84    18 male   no     yes     high     sat       71     73  1993
2 1997_K John           191     78    19 male   no     yes     high     ran       68    136  1997
3 1997_O Albert         194    110    25 male   no     no      moderate sat       75     75  1997
4 1997_Q Lance          192    105    21 male   no     no      moderate sat       80     73  1997
5 1997_R Christopher    194     95    18 male   no     yes     moderate ran       84    140  1997

or another example would be, who’s weight is exactly 55 kg?

filter(pulse, weight==55) # filter all observations with weight equal to 55 kg 
# A tibble: 5 x 13
  id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
2 1993_W Glenna       163     55    20 female no     no      low      sat       78     74  1993
3 1995_J Gwen         163     55    20 female no     no      moderate ran       70    119  1995
4 1998_G Ursula       155     55    20 female no     yes     high     sat       82     87  1998
5 1998_I Eleanor      168     55    24 female no     no      moderate sat       74     70  1998

filter uses logical vectors to achieve this results. You’ve already seen an example of a logical vector in the section on Vectors. The result of a comparison with relational operators {==, !=, <, <=, >, >=} is a logical.

‘==’ stands for ‘equal to’ and ‘!=’ is ‘not equal to’

In the examples above the comparisons height>190 and weight==55 are in fact logical vectors with the same length as variables height and weight in the pulse dataset.

The function filter takes the logical vector and returns only the rows in positions for which the value is TRUE.

Multiple conditions : and (‘&’), or (‘|’)

In the previous examples with filter we only had a single condition. We can expand the conditions to ask more complex questions. For example which males weigh less than 70? Here we have two conditions:

  • gender=="male" : male observations
  • weight<70 : weigh less than 70

And : ‘&’

Both these conditions must be met, with other words for each observation gender=="male" and weight<70 must be true and this is expressed with the symbol & :

filter(pulse, gender=="male" & weight<70)
# A tibble: 24 x 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
 2 1993_L Frederick    178     58    19 male   no     no      low      sat       74     76  1993
 3 1993_M Justin       170     68    22 male   yes    yes     moderate sat       70     71  1993
 4 1993_N Ernest       187     59    18 male   no     yes     high     sat       78     82  1993
 5 1993_Q Leslie       170     56    19 male   no     no      low      sat       64     63  1993
 6 1993_U Jerome       175     60    19 male   no     no      low      sat       88     86  1993
 7 1993_Y John         176     59    19 male   no     no      moderate sat       68     69  1993
 8 1995_A Pedro        170     60    18 male   no     yes     moderate sat       62     59  1995
 9 1995_C Oscar        189     60    19 male   no     yes     moderate ran       78    168  1995
10 1995_I Marc         175     65    19 male   no     yes     moderate ran       60    104  1995
# … with 14 more rows

Another example: select females with high frequency of exercise:

filter(pulse, gender=="female" & exercise=="high")
# A tibble: 3 x 13
  id     name     height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_C Consuelo    167     62    18 female no     yes     high     ran       96    176  1993
2 1993_I Sonja       164     56    19 female no     yes     high     sat       68     68  1993
3 1998_G Ursula      155     55    20 female no     yes     high     sat       82     87  1998

Or : ‘|’

The or conditional operator is true when at least one of the conditions is true. For example, list all rows with individuals who drink or smoke or both:

filter(pulse, alcohol=="yes" | smokes=="yes")
# A tibble: 70 x 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female no     yes     moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female no     yes     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female no     yes     high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   no     yes     high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 6 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female no     yes     moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female no     yes     high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
# … with 60 more rows

There are 70 observations who at least do one of the two: drink or smoke.

Which group is missing from the result above?

Those that neither smoke nor drink.


Another example, filter rows from years 1995 and 1997

filter(pulse, year==1995 | year==1997)
# A tibble: 45 x 13
   id     name   height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1995_A Pedro     170     60    18 male   no     yes     moderate sat       62     59  1995
 2 1995_B Olga      172     60    21 female no     no      low      sat       81     79  1995
 3 1995_C Oscar     189     60    19 male   no     yes     moderate ran       78    168  1995
 4 1995_D Kelli     178     56    21 female no     yes     moderate ran       86    150  1995
 5 1995_E Scott     175     75    20 male   no     yes     moderate ran       59     92  1995
 6 1995_F Bobby     180     85    19 male   yes    yes     moderate ran       68    125  1995
 7 1995_G Laurie    160     57    19 female no     no      moderate ran       75    130  1995
 8 1995_H Eliza     164     66    23 female no     no      low      ran       74    168  1995
 9 1995_I Marc      175     65    19 male   no     yes     moderate ran       60    104  1995
10 1995_J Gwen      163     55    20 female no     no      moderate ran       70    119  1995
# … with 35 more rows

Here we have the same variable year in our | construct.

What is wrong with this command: filter(pulse, year==1995 & year==1997) ?

An observation in a variable can not take two values at the same time. Here we have conditions on both sides of ‘&’ on the same varible, in this case year, and since year can not be 1995 and 1997 at the same time for an observation the result of this query is empty.




Copyright © 2021 Biomedical Data Sciences (BDS) | LUMC